Model Selection

Vision Transformer architecture

# Vision Transformer architecture

Sapiens Seg 0.6b Bfloat16

Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024x1024 resolution human images, focusing on human-centric vision tasks.

Image Segmentation English

Sapiens Seg 0.3b

Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024×1024 resolution human images, focusing on human-centric vision tasks.

Image Segmentation English

Vit Base Patch32 224 In21k

This Vision Transformer (ViT) model is pretrained on the ImageNet-21k dataset at 224x224 resolution, suitable for image classification tasks.

Image Classification

This is a Dense Prediction Transformer (DPT) model fine-tuned on the ADE20k dataset for semantic segmentation tasks.

Image Segmentation

A monocular depth estimation model based on Vision Transformer (ViT), trained on 1.4 million images, suitable for zero-shot depth prediction tasks.

Beit Large Finetuned Ade 640 640

BEiT is an image segmentation model based on Vision Transformer architecture, achieving efficient semantic segmentation through self-supervised pre-training and fine-tuning on the ADE20k dataset.

Image Segmentation

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase